To find the best players, it was crucial that the variables taken into account consisted of the number of games that the players started in and the total number of points that a player had for the season. The number of games that a player has started is important because it shows how other teams thought highly of the player. The field goals percentage is also important because it has the overall number of field goals including 3-pointers and 2-pointers that the players correctly executed. The amount of total rebounds is also essential to know because it can be an important factor of determining possession of the ball in future games. The number of assists is crtiical as a variable because it shows how the player is the second-to-last player to touch the ball before a point is scored. Lastly, the number of points that a player had for the season is important because it shows how many times the player was able to execute a play and the throw perfectly.
This study will determine which professional players are a high-performance athlete with a low salary. To ensure that this determination is accurate, the data will first be normalized. Then, the variables will be selected and run through the k-means algorithm. The number of clusters will be determined by the Nbclust package. Then the data will be plotted to visualize the data and then the results will be validated.
####Merging the Datasets
# converting the data into data frame format
nba <- as.data.frame(nba)
nba_sal <- as.data.frame(nba_sal)
nba <- merge(nba, nba_sal)
# normalize the columns before they're added
nba$GS <- normalize(nba$GS)
nba$FG <- scale(nba$FG, center= TRUE, scale = TRUE)
nba$TRB <- normalize(nba$TRB)
nba$AST <- normalize(nba$AST)
nba$PTS <- normalize(nba$PTS)
# Subsetting the data with the selected variables
clust_data = nba[, c("GS","FG", "TRB", "AST","PTS" )] #
View(clust_data)
View(nba)
# Run an algorithm with 2 centers and make the results reproducible with set.seed
set.seed(1)
kmeans_obj_nba = kmeans(clust_data, centers = 2, algorithm = "Lloyd")
head(kmeans_obj_nba)
## $cluster
## [1] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 1 1 2 1 1 2
## [38] 2 1 2 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 1 2 1 1
## [75] 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1
## [112] 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 2 1 2 2 2 1 2 1 1
## [149] 1 1 2 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [186] 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1
## [223] 2 2 1 2 2 1 2 2 2 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2
## [260] 1 1 1 1 2 1 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 1 2 1 1 1 2 2 1 1 1 1 1 1
## [297] 1 2 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1
## [334] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1
## [371] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 2 2 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1
## [445] 1 1 1 1 1 2 1 1 1 2 1 2
##
## $centers
## GS FG TRB AST PTS
## 1 0.1646724 -0.525990 0.1476800 0.09762642 0.1393395
## 2 0.7061573 1.277404 0.3925738 0.34335840 0.5222029
##
## $totss
## [1] 562.043
##
## $withinss
## [1] 99.39642 103.48878
##
## $tot.withinss
## [1] 202.8852
##
## $betweenss
## [1] 359.1578
####Correlation between the Games Started and Points
# This plot shows the correlation between the games started and the points each player has
sal_clusters = as.factor(kmeans_obj_nba$cluster)
b <- ggplot(nba, aes(x = GS, y = PTS, shape = sal_clusters, color = "2020-21", text=Player))+geom_point(size = 6)+ggtitle("Games Started vs. Points for NBA Basketball players") +xlab("Number of Games Started")+ylab("Number of Points")+scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2"))+ theme_light()
ggplotly(b, tooltip="text")
####Correlation between the Games Started and Total Rebounds
sal_clusters = as.factor(kmeans_obj_nba$cluster)
c <-ggplot(nba, aes(x = GS, y = TRB, shape = sal_clusters, color = "2020-21", text=Player))+geom_point(size = 6) + ggtitle("Games Started vs. Total Rebounds NBA Basketball players") + xlab("Number of Games Started")+ylab("Number of Total Rebounds") + scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2"))+ theme_light()
ggplotly(c, tooltip="text")
####Correlation between the Games Started and Field Goals
sal_clusters = as.factor(kmeans_obj_nba$cluster)
a <-ggplot(nba, aes(x = GS, y = FG, shape = sal_clusters, color = "2020-21", text = Player))+geom_point(size = 6) + ggtitle("Games Started vs. Field Goals for NBA Basketball players") + xlab("Number of Games Started")+ylab("Number of Field Goals") + scale_shape_manual(name = "Cluster", labels = c("Cluster 1", "Cluster 2"), values = c("1", "2"))+ theme_light()
ggplotly(a, tooltip="text")
# using the NbClust algorithm to find the ideal number of clusters
(nbclust_obj_nba = NbClust(data = clust_data, method= "kmeans"))
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 3.0631 803.6941 378.8952 6.9172 2758.852 112045.62 3423.1353 202.8852
## 3 2.2640 924.6234 222.2366 14.8767 3136.610 110103.39 893.7890 110.5900
## 4 2.3614 990.7091 122.0779 17.6022 3444.050 99740.56 300.5948 74.1922
## 5 1.9726 972.0767 77.4164 17.9587 3681.684 92548.54 206.5897 58.4152
## 6 1.1996 924.5795 67.0902 17.5206 3962.401 72007.44 149.0231 49.8570
## 7 20.5719 894.5432 26.4921 17.4593 4087.379 74514.71 109.3953 43.3882
## 8 0.0451 813.9592 66.2986 15.9640 4192.734 77247.73 93.1312 40.9709
## 9 75.5983 824.0572 18.6549 18.5210 4359.232 67860.60 68.7249 35.6893
## 10 0.0489 763.4260 29.5501 17.6752 4436.504 70719.51 63.8859 34.2595
## 11 1.5107 733.9124 23.9469 17.7246 4583.319 62015.10 55.2905 32.1307
## 12 1.6571 703.6891 9.5826 17.5989 4630.069 66611.65 48.9329 30.4899
## 13 0.4260 658.2826 24.9826 16.7159 4650.373 74771.69 47.0832 29.8458
## 14 0.2977 642.3814 55.0067 16.9179 4744.486 70546.12 42.3623 28.2525
## 15 7.4899 673.1335 17.2746 19.0006 4890.092 58847.00 30.5451 25.1256
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 253.1478 3.3181 0.2167 0.7359 0.5699 0.7740 115.0449 0.9110 0.5031
## 3 282.4617 6.0873 0.2184 0.7535 0.5139 1.0477 -14.9314 -0.1419 0.4689
## 4 307.5183 9.0736 0.2242 0.8201 0.4716 1.3500 -45.6268 -0.8070 0.4176
## 5 314.6520 11.5243 0.1958 0.9159 0.4412 0.9844 1.1435 0.0489 0.3831
## 6 321.6205 13.5025 0.1970 0.9357 0.4452 1.7746 -37.1015 -1.3358 0.3581
## 7 323.5198 15.5156 0.1991 0.9791 0.4348 2.0499 -92.7042 -1.5917 0.3334
## 8 326.5670 16.4310 0.1902 1.0554 0.4180 0.9502 10.1228 0.1630 0.3128
## 9 334.6519 18.8626 0.1673 1.0195 0.4090 1.0931 -4.0034 -0.2583 0.2967
## 10 340.7736 19.6498 0.1628 1.0640 0.3987 1.9732 -73.4876 -1.5303 0.2826
## 11 343.5793 20.9517 0.1542 1.0698 0.3969 0.8568 23.7239 0.5184 0.2704
## 12 346.1570 22.0792 0.1493 1.0644 0.3625 2.3743 -63.6704 -1.7787 0.2595
## 13 349.5595 22.5557 0.1485 1.0936 0.3614 1.9891 -39.2844 -1.5295 0.2500
## 14 358.4058 23.8277 0.1421 1.1314 0.3525 2.4180 -48.0875 -1.8027 0.2408
## 15 369.1553 26.7931 0.1895 1.1076 0.3408 2.5750 -42.2039 -1.8813 0.2343
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 101.4426 0.6764 1.2708 0.2556 0.0208 0.0023 7.7445 0.5796 0.8518
## 3 36.8633 0.6497 1.5072 0.4279 0.0319 0.0028 5.6287 0.4358 0.4579
## 4 18.5480 0.5845 1.3564 0.6085 0.0295 0.0030 5.0875 0.3584 0.4513
## 5 11.6830 0.5243 0.6694 0.7960 0.0145 0.0032 5.7745 0.3065 0.3784
## 6 8.3095 0.5117 0.5801 0.8310 0.0160 0.0032 6.5039 0.2880 0.2871
## 7 6.1983 0.5028 1.0068 0.8484 0.0175 0.0032 6.0277 0.2715 0.1598
## 8 5.1214 0.4911 1.4461 0.8832 0.0250 0.0032 7.6111 0.2639 0.2649
## 9 3.9655 0.4369 0.6860 1.0943 0.0272 0.0033 7.6024 0.2371 0.1020
## 10 3.4260 0.4321 0.6208 1.1081 0.0324 0.0033 9.1923 0.2319 0.0979
## 11 2.9210 0.4236 4.8170 1.1279 0.0324 0.0033 9.2106 0.2246 0.1011
## 12 2.5408 0.3707 4.2427 1.4869 0.0217 0.0034 11.9838 0.2123 0.1073
## 13 2.2958 0.3624 0.3658 1.5567 0.0177 0.0034 13.8035 0.2089 0.0907
## 14 2.0180 0.3582 0.5465 1.5593 0.0177 0.0034 13.5507 0.2046 0.0840
## 15 1.6750 0.3545 0.2330 1.5747 0.0242 0.0034 17.1298 0.1969 0.0644
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.7696 117.9564 0.4728
## 3 0.7546 106.6898 1.0000
## 4 0.7365 62.9768 1.0000
## 5 0.6374 40.9588 0.9985
## 6 0.5995 56.7775 1.0000
## 7 0.7172 71.3659 1.0000
## 8 0.7183 75.6915 0.9760
## 9 0.5502 38.4256 1.0000
## 10 0.7014 63.4302 1.0000
## 11 0.7007 60.6631 0.7625
## 12 0.6251 65.9670 1.0000
## 13 0.6315 46.1003 1.0000
## 14 0.6273 48.7193 1.0000
## 15 0.6315 40.2648 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 9.0000 4.0000 3.0000 15.0000 3.0000 6.00 3.000
## Value_Index 75.5983 990.7091 156.6586 19.0006 377.7579 23048.36 2529.346
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 3.0000 9.0000 14.0000 2.0000 2.0000 2.000
## Value_Index 55.8974 29.3138 -1.6444 0.1421 0.7359 0.5699 0.774
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.000 2.0000 3.0000 2.0000 4.0000 2.0000
## Value_Index 115.0449 0.911 0.5031 64.5793 0.6764 1.3564 0.2556
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 10.0000 0 4.0000 0 15.0000
## Value_Index 0.0324 0 5.0875 0 0.0644
##
## $Best.partition
## [1] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 1 1 2 1 1 2
## [38] 2 1 2 2 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 1 2 1 1
## [75] 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1
## [112] 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 2 1 2 2 2 1 2 1 1
## [149] 1 1 2 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [186] 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1
## [223] 2 2 1 2 2 1 2 2 2 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2
## [260] 1 1 1 1 2 1 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 2 2 1 2 1 1 1 2 2 1 1 1 1 1 1
## [297] 1 2 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1
## [334] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1
## [371] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 2 2 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1
## [445] 1 1 1 1 1 2 1 1 1 2 1 2
# subset the first row from Best.nc and convert to a data frame
freq_k_nba = nbclust_obj_nba$Best.nc[1,]
freq_k_nba = data.frame(freq_k_nba)
#Plot the recommended number of clusters as a histogram
ggplot(freq_k_nba, aes(x = freq_k_nba)) + geom_bar()+ scale_x_continuous(breaks = seq(0, 15, by = 1)) + scale_y_continuous(breaks = seq(0, 12, by = 1)) + labs(x = "Number of Clusters", y = "Number of Votes", title = "Cluster Analysis")
From the cluster analysis, the recommended number of clusters is 2.
I would recommend Michael Porter Jr., Norman Powell, and PJ Washington. I recommend these three players because they have done well in terms of how many field goals they’ve completed during the last season as well as the number of points that they accumulated. They are also not paid as well as the other athletes and seem to be high-performing.